retry transient tcp error by jgao54 · Pull Request #4174 · PeerDB-io/peerdb

jgao54 · 2026-04-14T08:51:24Z

This PR fix two things:

handle error code: 1001, message: std::__1::ios_base::failure: ios_base::clear: unspecified iostream_category error. This is a transient networking error that can happen in s3 during read and usually recovers on single retry. We want to also categorize it as retryable to avoid starting over the snapshot after a single error.
when checking for VM substring, use error.Message() so we can search substring in all the wrapped error messages as well. The current behavior is we would classify as Normalization Error which would also notify, but ViewError is more direct here.

codecov · 2026-04-14T09:09:08Z

❌ 2 Tests Failed:

Tests completed	Failed	Passed	Skipped
2185	2	2183	196

View the top 3 failed test(s) by shortest run time

github.com/PeerDB-io/peerdb/flow/e2e::TestApiMy

Stack Traces | 0s run time

=== RUN   TestApiMy
=== PAUSE TestApiMy
=== CONT  TestApiMy
--- FAIL: TestApiMy (0.00s)

github.com/PeerDB-io/peerdb/flow/e2e::TestGenericBQ

Stack Traces | 0s run time

=== RUN   TestGenericBQ
=== PAUSE TestGenericBQ
=== CONT  TestGenericBQ
--- FAIL: TestGenericBQ (0.00s)

github.com/PeerDB-io/peerdb/flow/e2e::TestApiMongo

Stack Traces | 0.01s run time

=== RUN   TestApiMongo
=== PAUSE TestApiMongo
=== CONT  TestApiMongo
--- FAIL: TestApiMongo (0.01s)
2026/04/14 20:36:38 INFO Executing and processing query x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT id,val FROM e2e_test_api_qs8euhl4.\"t1\" ORDER BY id"
2026/04/14 20:36:38 INFO Executing and processing query stream x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT id,val FROM e2e_test_api_qs8euhl4.\"t1\" ORDER BY id"
2026/04/14 20:36:38 INFO [pg_query_executor] declared cursor x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart cursorQuery="DECLARE peerdb_cursor_16246957176269917882 CURSOR FOR SELECT id,val FROM e2e_test_api_qs8euhl4.\"t1\" ORDER BY id" args=[]
2026/04/14 20:36:38 INFO [pg_query_executor] fetching rows start x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT id,val FROM e2e_test_api_qs8euhl4.\"t1\" ORDER BY id" channelLen=0
2026/04/14 20:36:38 INFO [pg_query_executor] fetching from cursor x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart cursor=peerdb_cursor_16246957176269917882
2026/04/14 20:36:38 INFO processed row stream x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart cursor=peerdb_cursor_16246957176269917882 records=2 bytes=19 channelLen=1
2026/04/14 20:36:38 INFO [pg_query_executor] fetched rows x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT id,val FROM e2e_test_api_qs8euhl4.\"t1\" ORDER BY id" rows=2 bytes=19 channelLen=1
2026/04/14 20:36:38 INFO [pg_query_executor] fetching from cursor x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart cursor=peerdb_cursor_16246957176269917882
2026/04/14 20:36:38 INFO processed row stream x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart cursor=peerdb_cursor_16246957176269917882 records=0 bytes=0 channelLen=0
2026/04/14 20:36:38 INFO [pg_query_executor] fetched rows x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT id,val FROM e2e_test_api_qs8euhl4.\"t1\" ORDER BY id" rows=0 bytes=0 channelLen=0
2026/04/14 20:36:38 INFO [pg_query_executor] committing transaction x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart
2026/04/14 20:36:38 INFO [pg_query_executor] committed transaction for query x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT id,val FROM e2e_test_api_qs8euhl4.\"t1\" ORDER BY id" rows=2 bytes=19 channelLen=0

github.com/PeerDB-io/peerdb/flow/e2e::TestApiMy/TestCancelTableAdditionRemoveAddRemove

Stack Traces | 22.1s run time

=== RUN   TestApiMy/TestCancelTableAdditionRemoveAddRemove
=== PAUSE TestApiMy/TestCancelTableAdditionRemoveAddRemove
=== CONT  TestApiMy/TestCancelTableAdditionRemoveAddRemove
2026/04/14 20:30:02 INFO Received AWS credentials from peer for connector: ci x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN}
2026/04/14 20:30:02 INFO Received AWS credentials from peer for connector: clickhouse x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN}
2026/04/14 20:30:02 INFO fetched schema x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} table=e2e_test_mych_gxfbxtd4.test_exclude_ch
    cancel_table_addition_test.go:637: WaitFor wait for initial load to finish 2026-04-14 20:30:08.448115732 +0000 UTC m=+259.548528067
    cancel_table_addition_test.go:641: WaitFor t1 2026-04-14 20:30:08.448464086 +0000 UTC m=+259.548876431
2026/04/14 20:30:08 INFO fetched schema x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} table=e2e_test_api_hlqe9ibs.t1
    cancel_table_addition_test.go:642: WaitFor t2 2026-04-14 20:30:08.460484694 +0000 UTC m=+259.560897039
    cancel_table_addition_test.go:82: WaitFor wait for pause for remove e2e_test_api_hlqe9ibs.t2 2026-04-14 20:30:08.474570456 +0000 UTC m=+259.574982801
2026/04/14 20:30:08 INFO fetched schema x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} table=e2e_test_mychclg_0tir1unq.test_simple_schema_changes
    cancel_table_addition_test.go:83: UNEXPECTED ERROR unable to establish connection with catalog: FATAL: terminating connection due to administrator command (SQLSTATE 57P01)
    api_test.go:48: begin tearing down postgres schema api_hlqe9ibs
--- FAIL: TestApiMy/TestCancelTableAdditionRemoveAddRemove (22.13s)

github.com/PeerDB-io/peerdb/flow/e2e::TestApiMongo/TestCancelTableAdditionRemoveAddRemove

Stack Traces | 27.1s run time

=== RUN   TestApiMongo/TestCancelTableAdditionRemoveAddRemove
=== PAUSE TestApiMongo/TestCancelTableAdditionRemoveAddRemove
=== CONT  TestApiMongo/TestCancelTableAdditionRemoveAddRemove
2026/04/14 20:35:21 INFO Received AWS credentials from peer for connector: ci x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN}
2026/04/14 20:35:21 INFO Received AWS credentials from peer for connector: clickhouse x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN}
2026/04/14 20:35:21 INFO fetched schema x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} table=e2e_test_api_c3ysbutl.t1
    cancel_table_addition_test.go:637: WaitFor wait for initial load to finish 2026-04-14 20:35:27.473358265 +0000 UTC m=+596.561897192
    cancel_table_addition_test.go:641: WaitFor t1 2026-04-14 20:35:27.473761152 +0000 UTC m=+596.562300084
    cancel_table_addition_test.go:642: WaitFor t2 2026-04-14 20:35:27.478017915 +0000 UTC m=+596.566556855
    cancel_table_addition_test.go:82: WaitFor wait for pause for remove e2e_test_api_yvpyaayk.t2 2026-04-14 20:35:27.484166648 +0000 UTC m=+596.572705574
2026/04/14 20:35:27 INFO fetched schema x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} table=e2e_test_api_c3ysbutl.t1
    cancel_table_addition_test.go:100: WaitFor wait for table removal of source_table_identifier:"e2e_test_api_yvpyaayk.t2" destination_table_identifier:"t2" to finish 2026-04-14 20:35:43.509681661 +0000 UTC m=+612.598220600
2026/04/14 20:35:43 INFO [pg_query_executor] declared cursor x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart cursorQuery="DECLARE peerdb_cursor_1611523372885743617 CURSOR FOR SELECT id,val FROM e2e_test_api_5ucrwzqt.\"table1\" ORDER BY id" args=[]
2026/04/14 20:35:43 INFO [pg_query_executor] fetching rows start x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT id,val FROM e2e_test_api_5ucrwzqt.\"table1\" ORDER BY id" channelLen=0
2026/04/14 20:35:43 INFO [pg_query_executor] fetching from cursor x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart cursor=peerdb_cursor_1611523372885743617
2026/04/14 20:35:43 INFO processed row stream x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart cursor=peerdb_cursor_1611523372885743617 records=2 bytes=19 channelLen=1
2026/04/14 20:35:43 INFO [pg_query_executor] fetched rows x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT id,val FROM e2e_test_api_5ucrwzqt.\"table1\" ORDER BY id" rows=2 bytes=19 channelLen=1
2026/04/14 20:35:43 INFO [pg_query_executor] fetching from cursor x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart cursor=peerdb_cursor_1611523372885743617
2026/04/14 20:35:43 INFO processed row stream x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart cursor=peerdb_cursor_1611523372885743617 records=0 bytes=0 channelLen=0
2026/04/14 20:35:43 INFO [pg_query_executor] fetched rows x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT id,val FROM e2e_test_api_5ucrwzqt.\"table1\" ORDER BY id" rows=0 bytes=0 channelLen=0
2026/04/14 20:35:43 INFO [pg_query_executor] committing transaction x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart
2026/04/14 20:35:43 INFO [pg_query_executor] committed transaction for query x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT id,val FROM e2e_test_api_5ucrwzqt.\"table1\" ORDER BY id" rows=2 bytes=19 channelLen=0
    cancel_table_addition_test.go:127: WaitFor wait for pause for add e2e_test_api_yvpyaayk.t2 2026-04-14 20:35:44.51465261 +0000 UTC m=+613.603191542
2026/04/14 20:35:44 INFO Executing and processing query x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT id,val FROM e2e_test_api_5ucrwzqt.\"table1\" ORDER BY id"
2026/04/14 20:35:44 INFO Executing and processing query stream x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT id,val FROM e2e_test_api_5ucrwzqt.\"table1\" ORDER BY id"
2026/04/14 20:35:44 INFO [pg_query_executor] declared cursor x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart cursorQuery="DECLARE peerdb_cursor_12950402609540204079 CURSOR FOR SELECT id,val FROM e2e_test_api_5ucrwzqt.\"table1\" ORDER BY id" args=[]
2026/04/14 20:35:44 INFO [pg_query_executor] fetching rows start x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT id,val FROM e2e_test_api_5ucrwzqt.\"table1\" ORDER BY id" channelLen=0
2026/04/14 20:35:44 INFO [pg_query_executor] fetching from cursor x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart cursor=peerdb_cursor_12950402609540204079
2026/04/14 20:35:44 INFO processed row stream x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart cursor=peerdb_cursor_12950402609540204079 records=2 bytes=19 channelLen=1
2026/04/14 20:35:44 INFO [pg_query_executor] fetched rows x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT id,val FROM e2e_test_api_5ucrwzqt.\"table1\" ORDER BY id" rows=2 bytes=19 channelLen=1
2026/04/14 20:35:44 INFO [pg_query_executor] fetching from cursor x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart cursor=peerdb_cursor_12950402609540204079
2026/04/14 20:35:44 INFO processed row stream x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart cursor=peerdb_cursor_12950402609540204079 records=0 bytes=0 channelLen=0
2026/04/14 20:35:44 INFO [pg_query_executor] fetched rows x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT id,val FROM e2e_test_api_5ucrwzqt.\"table1\" ORDER BY id" rows=0 bytes=0 channelLen=0
2026/04/14 20:35:44 INFO [pg_query_executor] committing transaction x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart
2026/04/14 20:35:44 INFO [pg_query_executor] committed transaction for query x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT id,val FROM e2e_test_api_5ucrwzqt.\"table1\" ORDER BY id" rows=2 bytes=19 channelLen=0
    cancel_table_addition_test.go:128: UNEXPECTED ERROR unable to establish connection with catalog: FATAL: terminating connection due to administrator command (SQLSTATE 57P01)
    api_test.go:48: begin tearing down postgres schema api_yvpyaayk
--- FAIL: TestApiMongo/TestCancelTableAdditionRemoveAddRemove (27.12s)

github.com/PeerDB-io/peerdb/flow/e2e::TestGenericBQ/Test_Simple_Flow

Stack Traces | 33.6s run time

=== RUN   TestGenericBQ/Test_Simple_Flow
=== PAUSE TestGenericBQ/Test_Simple_Flow
=== CONT  TestGenericBQ/Test_Simple_Flow
    generic_test.go:124: UNEXPECTED STATUS TIMEOUT STATUS_SNAPSHOT
    bigquery.go:86: begin tearing down postgres schema bq_ipllifnk_20260414204626
--- FAIL: TestGenericBQ/Test_Simple_Flow (33.63s)

To view more test analytics, go to the Test Analytics Dashboard
_{📋 Got 3 mins? Take this short survey to help us improve Test Analytics.}

github-actions · 2026-04-14T09:11:48Z

🔄 Flaky Test Detected

Analysis: Both failures are caused by SQLSTATE 57P01 (PostgreSQL admin shutdown), a transient infrastructure event where the database connection was terminated externally — not a code logic bug.
Confidence: 0.95

✅ Automatically retrying the workflow

View workflow run

github-actions · 2026-04-14T19:00:14Z

🔄 Flaky Test Detected

Analysis: Both failures stem from PostgreSQL connection termination (SQLSTATE 57P01 — "terminating connection due to administrator command") and an active replication slot blocking teardown (SQLSTATE 55006), which are transient infrastructure/resource-contention issues in the concurrent e2e test environment, not code logic bugs.
Confidence: 0.92

✅ Automatically retrying the workflow

View workflow run

github-actions · 2026-04-14T19:19:07Z

🔄 Flaky Test Detected

Analysis: Both failures are transient: one is a snapshot status timeout (async timing issue) and the other is a PostgreSQL connection terminated by admin shutdown (SQLSTATE 57P01), neither indicating a code regression.
Confidence: 0.93

✅ Automatically retrying the workflow

View workflow run

ilidemi · 2026-04-14T20:14:24Z


+// conditionallyRetryableExceptions are error codes that are only retryable
+// when the error message contains one of the specified substrings
+var conditionallyRetryableExceptions = map[chproto.Error][]string{


Just an idea, how about retryableExceptionSubstrings? Wouldn't need an explanation this way

github-actions · 2026-04-14T20:42:28Z

🔄 Flaky Test Detected

Analysis: The test failed due to a transient PostgreSQL connection drop (SQLSTATE 57P01 — admin_shutdown), where the catalog DB connection was terminated by an administrator command mid-test, indicating a CI infrastructure issue rather than a code bug.
Confidence: 0.95

✅ Automatically retrying the workflow

View workflow run

github-actions · 2026-04-14T21:00:12Z

🔄 Flaky Test Detected

Analysis: TestGenericBQ/Test_Simple_Flow timed out waiting for STATUS_SNAPSHOT to complete — a classic transient timeout in an e2e test that depends on BigQuery and external services, with only 2 failures out of 2348 tests and the codebase itself flagging BigQuery tests as flaky under high concurrency.
Confidence: 0.92

✅ Automatically retrying the workflow

View workflow run

retry error

e43ff48

jgao54 requested review from ilidemi and masterashu April 14, 2026 08:51

masterashu approved these changes Apr 14, 2026

View reviewed changes

jgao54 commented Apr 14, 2026

View reviewed changes

Comment thread flow/otel_metrics/otel_manager.go Outdated

nit

c7f7ba7

ilidemi approved these changes Apr 14, 2026

View reviewed changes

review

4b2284e

jgao54 enabled auto-merge (squash) April 14, 2026 20:18

jgao54 merged commit dbfbd2d into main Apr 14, 2026
17 of 20 checks passed

jgao54 deleted the normalize-err branch April 14, 2026 21:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

retry transient tcp error#4174

retry transient tcp error#4174
jgao54 merged 3 commits intomainfrom
normalize-err

jgao54 commented Apr 14, 2026 •

edited

Loading

Uh oh!

codecov bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 14, 2026

Uh oh!

Uh oh!

github-actions bot commented Apr 14, 2026

Uh oh!

github-actions bot commented Apr 14, 2026

Uh oh!

ilidemi Apr 14, 2026

Uh oh!

github-actions bot commented Apr 14, 2026

Uh oh!

github-actions bot commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jgao54 commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ 2 Tests Failed:

Uh oh!

github-actions bot commented Apr 14, 2026

🔄 Flaky Test Detected

Uh oh!

Uh oh!

github-actions bot commented Apr 14, 2026

🔄 Flaky Test Detected

Uh oh!

github-actions bot commented Apr 14, 2026

🔄 Flaky Test Detected

Uh oh!

ilidemi Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 14, 2026

🔄 Flaky Test Detected

Uh oh!

github-actions bot commented Apr 14, 2026

🔄 Flaky Test Detected

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jgao54 commented Apr 14, 2026 •

edited

Loading

codecov bot commented Apr 14, 2026 •

edited

Loading